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ABSTRACT: 

The present invention relates to a speech recognition system of the type which comprises 
storage means (1 0, 1 1 ) for storing selected parameters for each of a plurality of words in a 
vocabulary to be used for recognition of an input item of speech, comparison means (42) for 
comparing parameters of each unknown word in an input item of speech with the stored 
parameters, and indication means (12, 46) responsive to the result of the comparison operation 
for indicating which of the plurality of vocabulary words most closely resembles each unknown 
input word. According to the invention the speech recognition system is characterised in that 
the stored parameters comprise for each vocabulary word a set of labels each representing a 
feature of the vocabulary word occurring at a respective segmentation point in the vocabulary 
word and the probability of the feature associated with each label occurring at a segmentation 
point in a word. Further, the comparison means compares the stored sets of parameters with a 
set of labels for each unknown input word each representing a feature of the unknown input 
word occurring at a respective segmentation point in the unknown input word. 
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0 Speech recognition system. 



(§) The present invention relates to a speech recognition system of the type which comprises storage means 
(10, 11) for storing selected parameters for each of a plurality of words in a vocabulary to be used for 
recognition of an input item of speech, comparison means (42) for comparing parameters of each unknown word 
in an input item of speech with the stored parameters, and indication means (12. 46) responsive to the result of 
the comparison operation for indicating which of the plurality of vocabulary words most closely resembles each 
unknown input word. 

According to the invention the speech recognition system is characterised in that the stored parameters 
comprise for each vocabulary word a set of labels each representing a feature of the vocabulary word occurring 
at a respective segmentation point in the vocabulary word and the probability of the feature associated with each 
label occurring at a segmentation point in a word. Further, the comparison means compares the stored sets of 
parameters with a set of labels for each unknown input word each representing a feature of the unknown input 
^word occurring at a respective segmentation point in the unknown input word. 
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SPEECH RECOGNITION SYSTEM 

The present invention relates to a speech recognition system employing a probabilistic technique and, 
more paticulariy, to a speech recognition system wherein speech recognition may be performed conve- 
niently without deteriorating the recognition accuracy. 

As a probabilistic technique for recognising items of speech, there is known a technique using 

5 probabilistic Markov models. A Markov model is a finite state device having a plurality of states and the 
ability to undergo transitions between the states. For each transition the probability of occurrence is defined. 
Each transition results in the generation of a label and the probability of each of the labels being generated 
is also defined. The occurrence of a sequence of transitions results in the generation of a string of labels. 
For example, such a probabilistic model can be provided for each word of an item of speech and its 

to probability parameters can be established by training. During a speech recognition operation, a label string 
obtained from an unknown input item of speech is matched with each probabilistic model, and the word 
associated with a probabilistic model having highest probability of generating the same label string is 
determined as a recognition result Such a technique is described, for example, in an article by F. Jelinek, 
"Continuous Speech Recognition by Statistical Methods," Proceedings of the IEEE, Vol. 64, 1976, pp. 532 - 

T5 556. 

Speech recognition operation using Markov models, however, requires a great amount of training data 
for establishing the probability parameters by training and also a significant amount of calculating time for 
training. 

Other techniques in the prior art include the following: 
20 (1) Article by T. Kaneko, et aL, "Large Vocabulary Isolated Word Recognition with Linear and DP 

Matching." Proa 1983 Spring Conference of Acoustical Society of Japan, March 1983, pp. 151 - 152. 

(2) Article by T. Kaneko, et al., "A Hierarchical Decision Approach to Large Vocabulary Discrete 
Utterance Recognition/ IEEE Trans, on ASSP, Vol. ASSP-31, No. 5, October 1983. (3) Article by H. 
Fujisaki, et aL, "High-Speed Processing and Speaker Adaptation in Automatic Recognition of Spoken 
25 Words," Trans, of the Committee on Speech Recognition, The Acoustical Society of Japan, S80-19, June 
1980, pp. 148 - 155. 

(4) Article by D. K. Burton, et al., "A Generalization of Isolated Word Recognition Using Vector 
Quantization," fCASSP 83, pp. 1021 - 1024. 

These articles disclose dividing a word into blocks along a time axis. However, they describe nothing 
30 about obtaining label output probabifities In each of the blocks and performing probabilistic speech 
recognition based on the label output probabilities in each of the blocks. 

The object of the present invention is to provide an improved speech recognition system employing a 
probabilistic technique. 

The present invention relates to a speech recognition system of the type which comprises storage 
35 means for storing selected parameters for each of a plurality of words in a vocabulary to be used for 
recognition of an input item of speech, comparison means for comparing parameters of each unknown word 
in an input item of speech with the stored parameters, and indication means responsive to the result of the 
comparison operation for indicating which of the vocabulary words most closely resembles each unknown 
input word. 

40 According to the invention the recognition system is characterised in that the stored parameters 
comprise for each vocabulary word a set of labels each representing a feature of the vocabulary word 
occurring at a respective segmentation point in the vocabulary word, and the probability of the feature 
associated with each label occurring at a segmentation point in a word. Further, the comparison means 
compares the stored sets of parameters with a set of labels for each unknown input word each representing 

45 a feature of the unknown input word occurring at a respective segmentation point in the unknown input 
word. 

In order that the invention may be more readily understood an embodiment will now be described with 
reference to the accompanying drawings, in which: 

Fig. 1 is a block diagram illustrating a speech recognition system in accordance with the present 
so invention, 

Figs. 2A and 2B are flow charts for explaining the operation of a training unit included in the system 
illustrated in Rg. 1, 

Fig. 3 is a flow chart for xpiaining the operation of a recognition unit 9 included in the system 
illustrated in Rg. 1, and 



2 



0 241 183 



Rgs. 4, 5, 6, and 7 are diagrams for explaining th operation of certain of the functions illustrated in 
the flow charts in Fig. 2A and 2B. 

In Fig. 1, illustrating a speech recognition system according to the invention as a whole, Herns of 
speechare supplied to an analog/digital (A/D) converter 3 through a microphone 1 and an amplifier 2. The 
s items of speech can be training speech data or unknown speech data The A/D converter 3 converts 
theitem of speech into digital data by repeatedly sampling the items of speech at a frequency of 8 KHz. 
The digital data is supplied to a feature value extraction unit 4 to be converted into feature values by using 
the LPC analysis. A new feature value is generated every 14 nsec and is supplied to a labelling unit 5. The 
labelling unit 5 performs labelling with reference to a prototype dictionary 6. A label alphabet {f»} and 
io prototypes of feature values corresponding thereto are stored in the prototype dictionary 6, and the label f, 
having a prototype which is nearest to each input feature value is determined and produced as an output 
from the labelling unit 5. The number of elements of the label alphabet is 32, for example, and a prototype 
of a label may be obtained by sampling, at random, feature values in an item of speech spoken for 20 sec. 
Each label fj from the labelling unit 5 is supplied either to a training unit 8 or to a recognition unit 9 
75 through a switching means 7. The input terminal 7c of the switching means 7 is connected either to one 
output terminal 7a associated with the training unit 8 during training or to another output terminal 7b 
associated with the recognition unit 9 during recognition. 

The training unit 8 processes the label string obtained from items of speech representing training 
speech data and establishes a preselection table 10 and a probability table 11. The preselection table 10 
20 stores the maximum length LO) and the minimum length IQ) for the words in the vocabulary for use in a 
subsequent recognition operation. The probability table 11 stores the probability p (i, j. k) of each of the 
labels ffoccurring in each of blocks bj* obtained by equally dividing a word wj in a vocabulary for 
recognition. In fact, for convenience of calculations, the value of log p is stored in the probability table 11, 
instead of the value of the probability p itself. 
25 The recognition unit 9 processes a label string obtained from an item of speech of an unknown word by 
referring to the preselection table 10 and the probability table 11, and performs a recognition operation in 
two stages, to be stated later, to obtain a recognition result The recognition result is displayed on a CRT 
12, for example. 

The components in block 13 shown with a one-dot chain fine may be implemented using software in a 

30 personal computer, e.g., a PCXT manufactured by International Business Machines Corporation. These 
components may alternatively be implemented using hardware by adopting a configuration consisting of the 
blocks shown with solid lines within block 13 shown with the one-dot chain line. These blocks correspond 
respectively to the functions of the software, which will be explained later in detail in the explanation of 
steps corresponding thereto with reference to Rgs. 2A, 2B and 3. For ease of understanding, the blocks 

35 shown with solid lines in Fig. 1 are illustrated with the same numbers as those of the steps corresponding 
thereto shown in Figs. 2A, 2B and 3. 

The components in block 14 shown with a one-dot chain line may be implemented by a signal 
processing board added to a personal computer. 

Training of the recognition system will now be explained with reference to Figs. 2A and 2B. The 

40 system, which is for unspecified speakers, performs training based on items of speech spoken by a 
plurality of different training speakers. The speakers sequentially input training speech data In a particular 
embodiment, a speaker inputs a plurality of items of speech, for example three items of speech, for each of 
the words w | in the vocabulary to be used for recognition. 

In training, a histogram h (i, j, k) for each label f ( in the training speech data is obtained first in each of 

45 the blocks bjk in the word Wj. Fig. 2A illustrates a procedure for generating the histograms h (i, j, k). In Fig. 
2A, at the beginning, the maximum word length L(j). the minimum word length l(j), and j for each of the 
words wj are initialised (Step 15). They are set to Lfi) =-<*>, |(j) = + «> , and j = 0, respectively. Then, 
the CRT 12 (Rg. 1) displays an instruction to the speaker to speak the word wj three times (Step 16), and 
the speaker responds thereto. The A/D conversion, feature value extraction, and labelling are sequentially 

so performed on the items of speech (Steps 17-19). Then, the maximum word length L(j) and the minimum 
word length IQ) are updated, if necessary (Step 20). In the event the longest one of these three Herns of 
speech is longer than the maximum word length L(j), the value is set to a new maximum word length L(j). 
Similarly, in the event the shortest one of these three items of speech is shorter than the minimum word 
length l(j), the value is set to a new minimum word length l(j). 

55 Next normalisation of the word length and a block segmentation will be p rformed for each of the items 
of speech (Steps 21 and 22). In the normalisation of the word length, the number of labels included in one 
word is set to a predetermined number N t ( = No x N b , where N 0 is a positive integer and N b is the 
number of blocks b jk ), so that the block segmentation can be performed easily. The block segmentation 
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may be performed by using a unit smaller than a label. In that case, however, the calculation of histograms 
will be more complicated. In a particular example, the normalisation is performed by setting the number of 
blocks, Nb, to 8 and tie positive integer Ng to 10, so that one word includes 80 labels. This is illustrated in 
Rg. 4. The example illustrated in Rg. 4 shows a case where a word before the normalisation of the word 
length includes 30 labels. As seen from Rg. 4, some of the labels existing before the normalisation 
operation may often be skipped. In a particular example, a label f(t) at a time t after the normalisation 
operation (t = 0 - 79 ; the time units are intervals at which^ labels are produced) is equal to a label ff^t ) at 
a time t before the normalisation operation, assuming t = L ( t * 90) / 80 + 0.5j , where L aj indicates 
that the figures of a below the decimal point should be omitted. The above formula may typically be 
illustrated as in Rg. 5. Generally, the formula may be expressed as_*t = L ( t X ~H , ) / N f + 0-5 Jf where 
Nt is the number of labels after the normalisation operation and_tt f is the number of labels before the 
normalisation. In Rg. 4, N f = 90, and N f > N, , which may be M f £ Nf . 

In the block segmentation operation, each of the items of speech after the normalisation operation is 
equally divided into blocks b & as illustrated in Rg. 6. 

These Steps 16 through 23 are performed for ail of the words wj in the vocabulary to be used for 
recognition (Steps 24 and 25). The procedure of generating the histograms illustrated in Rg. 2A is shown 
for one speaker. By performing this procedure for a plurality of different speakers, it is possible to generate 
histograms h (i, j, k) which are not biased to any particular speaker. 

After having generated the histograms h (i, j, k) which are not biased to any particular speaker as stated 
above, the histograms are normalised and the probability p (i, j, k) of a feature having a label fioccurring in a 
block bjk in a word w j is calculated as illustrated in Rg. 2B (Step 26). This probability p (i. j, k) is obtained 
according to the following formula. 



h (i, j, k) 

p (i, j, k) = 



Z L h (i, j, k) 



The block segmentation and the calculation of histograms in Steps 22 and 23, respectively, illustrated in 
Rg. 2A may be performed as illustrated in Rg. 7, for example. Rg. 7 shows a case where the number of 
blocks t^ is 8 and the number of labels f t in the block bj* is 10. In Rg. 7, c, and c* indicate the values of 
counters, each of which is set to 0 at the beginning (Step 27). The value of c, is incremented by one. each 
time a label occurs (Step 29), and is reset to 0 when the counter has reached 10 (Step 31). The value of c* 
is incremented by one, each time the value of c, is reset (Step 31). With the end of each of the blocks 
bjfcand the end of each of the words being detected in Steps 30 and 32, respectively, the histograms h (i 
(10c* + c), j. c*> are incremented by one, every time t = 10c, + c. The i(t) indicates the number of a label 
at the time t ( t = 0 - 79; the time units are intervals at which labels are produced). 

Next referring to Rg. 3, an explanation will be made as to how the speech recognition system 
illustrated in Rg. 1 is used for the recognition of an unknown item of speech. 

In Rg. 3, when the data of an unknown word x is input (Step 33). the Ad conversion, feature value 
extraction, and labelling are sequentially performed on the data as already described (Steps 34, 35, and 
36). Then, the length of the unknown word x is determined (Step 37) and used in the subsequent 
preselection Step 40. The length of the unknown word x is normalised in the same manner as in Step 21 
illustrated in Rg. 2A (Step 38). 

In the preselection step 40. it is determined whether or a stored word wj satisfies the following formula 
in connection with an unknown word x, by referring to the preselection table 10 (Rg. 1). 

KD • ( 1 - A ) < Length (x) < L( j) - (1 + A) 

where the Length (x) denotes the length of the unknown word. The A is a small value, for xample 0.2. If 
this formula is not satisfied, the probability is specified as - « so that the stored word wj would be omitted 
from the x candidates for the recognition result (Step 43). How ver. if the formula is satisfied, after the 
unknown word x has been divided into the blocks bj*. in the same manner as in Step 22 illustrated in Rg. 
2A, the probability is calculated (step 42). The probability LH (j) of the stored word Wj being the unknown 
word x may be obtained according to the following formula 
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T 

LH (j) » E log p ( i(t) f j(t), k) 
t = 0 

All the Steps 40 through 43 are performed for all of the words Wj in the input Item of speech to be 
recognised (Steps 39, 44, and 45) and the probabilities LH 0) of all of the stored words wj are obtained. 
Then, the stored word having the highest probability LH (i) is output as the recognition result (Step 46). 

It should be understood that a speech recognition system according to the present invention is not 
limited to the above described embodiment, but various changes in form and details may be made therein 
without departing from the spirit and scope of the invention. For example, while in the above embodiment 
the recognition system has been implemented by software in a personal computer, it can, of course, be 
implemented by hardware. 

Further, while the speech recognition system described has been applied to speech recognition for 
unspecified speakers, such as used in banking systems, subway information systems and the like, it may 
also be applied to systems for specified speakers. 

Further, smoothing may be performed in the speech recognition system described above in order to 
improve the recognition accuracy. For example, in the event a label output probability is 0, it may be 
replaced with a value of the order of e = 10" 7 , or the histograms may be recalculated in consideration of 
confusions between labels. 

As explained, in the speech recognition system described above label output probabilities can be 
expressed quite simply. Therefore, the recognition system is able to be trained conveniently and reduce 
calculations during use for speech recognition. Further, since errors due to fluctuations Jn time can be 
absorbed by adopting the probabilistic expressions, recognition errors can be suppressed. 



Claims 

1 . A speech recognition system comprising 

storage means (10, 11) for storing selected parameters for each of a plurality of words in a vocabulary to 
be used for recognition of an input item of speech, 

comparison means (42) for comparing parameters of each unknown word in an input item of speech with 
said stored parameters, and indication means (12, 46) responsive to the result of said comparison 
operation for indicating which of said plurality of vocabulary words most closely resembles each unknown 
input word, 

characterised in that 

said stored parameters comprise for each vocabulary word 

a set of labels each representing a feature of said vocabulary word occurring at a respective 
segmentation point in said vocabulary word, and the probability of the feature associated with each label 
occurring at a segmentation point in a word, 

and in that 

said comparison means comprises 

means for comparing said stored sets of parameters with a set of labels for each unknown input word 
each representing a feature of said unknown input word occurring at a respective segmentation point in said 
unknown input word. 

2. A speech recognition system as claimed in claim 1 characterised in that said comparison means 
comprises normalising means (37) for normalising the length of each unknown word. 

3. A speech recognition system as claimed in either of the preceding claims characterised in that said 
comparison means comprises segmentation means (41) for segmenting each unknown word. 
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4. A speech recognition system as claimed in any on of the preceding claims characterised in that it 
comprises 

means {14} for receiving known words, and 

means (8) for generating selected parameters for each received known word for storing in said storage 
55 means (10, 11). 

5. A method of speech recognition system comprising 

storing selected parameters for each of a plurality of words in a vocabulary to be used for recognition of 
an input item of speech, 

comparing parameters of each unknown word in an input item of speech with said stored parameters. 

w and 

indicating, in response to the result of said comparison operation, which of said plurality of vocabulary 
words most closely resembles each unknown input word, 

characterised in that 

75 

said stored parameters comprise for each vocabulary word a set of labels each representing a feature of 
sakf vocabulary word occurring at a respective segmentation point in said vocabulary word, and the 
probability of the feature associated with each label occurring at a segmentation point in a word, 

20 aid in that 

said comparison operation compares said stored sets of parameters with 
a set of labels for each unknown input word each representing a feature of said unknown input word 
occurring at a respective segmentation point in said unknown input word- 
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